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Abstract Biological structure and function depend on complex 
regulatory interactions between many genes. A wealth of gene expres- 
sion data is available from high-throughput genome-wide measure- 
ment technologies, but effective gene regulatory network inference 
methods are still needed. Model-based methods founded on quanti- 
tative descriptions of gene regulation are among the most promising, 
but many such methods still rely on ad hoc inference approaches and 
lack experimental interpretability. We propose an experimental de- 
sign and develop an associated statistical method for learning a quan- 
titative, interpret able, predictive, biophysics-based ordinary differen- 
tial equation model for gene regulation. We fit the model parameters 
using gene expression measurements from perturbed steady-states of 
the system, like those following overexpression or knockdown exper- 
iments. Although the original model is nonlinear, our design allows 
us to transform it into a convex optimization problem by restricting 
attention to steady-states and using the lasso for parameter selec- 
tion. Here, we describe the model and inference algorithm and apply 
them to a synthetic six-gene system, demonstrating that the model is 
detailed and flexible enough to account for activation and repression 
as well as synergistic and self-regulation, and that the algorithm can 
efficiently and accurately recover the parameters used to generate the 
data. 

Introduction. Complex interactions between many genes give rise to biological structure and 
function that sustain life. The Central Dogma (Jacob and Monod, 1961; Crick, 1970) provides 
a qualitative description of how these processes occur, but precise quantitative modeling is still 
needed (Tyson, Chen and Novak, 2003; Rosenfeld, 2011). Research into the detailed mechanisms 
of gene expression over the past few decades has shown that expression is regulated by a complex 
system of gene interactions. Recently, microarray and sequencing technologies (DeRisi, Iyer and 
Brown, 1997; Ren et al., 2000; Robertson et al., 2007; Mortazavi et al., 2008) have enabled high- 
throughput genome-wide expression level measurements. This data enables detailed study of gene 
networks (Holstege et al., 1998; Lee et al., 2002; Tegner et al., 2003; Segal et al., 2003; Bar- Joseph 
et al., 2003; Hu, Killion and Iyer, 2007; Zhou et al., 2007). The goal is to understand how genes 
interact to give rise to the biochemical complexity that allows organisms to live, grow and reproduce. 

Gene expression measurements contain information useful for reconstructing the underlying inter- 
action structure (DeRisi, Iyer and Brown, 1997; Holstege et al., 1998; Hughes et al., 2000) because 
gene regulatory systems have a defined ordering (Avery and Wasserman, 1992), forming pathways 
that ultimately connect to form networks (Alon, 2007; De Smet and Marchal, 2010). At the turn of 
the century, researchers began applying statistical tools to genome- wide expression data to under- 
stand complex gene interactions. Eisen et al showed that genes from the same pathways and with 
similar functions cluster together by expression pattern (Eisen et al., 1998). Soon after, module- 
based network inference methods appeared, in which co-expressed genes are grouped into the same 
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module of cellular function (Segal et al., 2003; Bar- Joseph et al., 2003). More recently, methods 
based on descriptive but non-mechanistic mathematical models (Gardner et al., 2003; Tegner et al., 
2003; Bansal et al., 2007; Faith et al., 2007; Friedman, 2004) have gained prominence. Model-based 
methods provide a quantitative description of gene regulation and can be used to predict the future 
behavior of the system. However, most of these methods still rely on ad hoc inference approaches 
that depend on various assumptions to find interactions between genes without learning experi- 
mentally interpretable parameters. In this paper, we extend the current model-based literature by 
proposing a statistical method for inferring a mathematical model for gene regulation. 

Our method is based on a quantitative, experimentally interpretable biophysics-based ordinary 
differential equation (ODE) model of gene regulation. We focus on transcriptional regulation since 
transcription lies at the core of gene expression regulation (Holstege et al., 1998; Hu, Killion and 
Iyer, 2007). A standard biophysics approach to gene transcription uses a thermodynamic model of 
transcription factor (TF) binding to the gene promoter. TF binding can either activate or repress 
transcription. Many models of this type have been proposed, but we believe that the Bintu et al 
model is standard. The idea traces back to the beginning of systems biology in the biophysics field 
(Ackers, Johnson and Shea, 1982; Shea and Ackers, 1985; von Hippel et al., 1974), although it has 
been used only for modeling and not for reconstruction. 

The dynamical model described in Bintu et al is a rich, flexible and interpretable physics-based 
model. Their ODE-based model of gene expression is based on the thermodynamics of transcrip- 
tion factor and RNA polymerase binding to gene promoters. The form of the model is flexible 
enough to capture the full range of gene regulatory behavior in quantitative detail, and the pa- 
rameters are physically interpretable. Another notable thermodynamic model is that of the annual 
DREAM competition, but it has many biochemical assumptions and model parameters, like the 
Hill coefficient of transcription factor binding events, that cannot be estimated using gene expres- 
sion measurements, so the network reconstruction requires ad hoc inference methods to learn the 
underlying gene interactions (Yip et al., 2010; Pinna, Soranzo and de la Fuente, 2010; Marbach 
et al., 2010; Schaffter, Marbach and Floreano, 2011). Thus, we choose to base our approach on the 
Bintu model since it is simple, captures the mechanism of transcription well, and lends itself to 
statistical network inference. 

We have developed an approach to learn gene network structure by inducing perturbed steady- 
states and fitting the parameters in the thermodynamic model of Bintu et al using a set of gene 
expression measurements. Although the original inference problem is nonlinear, we can transform 
it into a convex optimization problem by restricting our attention to steady-states. We use the 
lasso (Tibshirani, 1996) for parameter selection. As a proof of principle, we test the method on 
a simulated embryonic stem cell (ESC) transcription network (Chickarmane and Peterson, 2008) 
given by a system of ODEs based on the Bintu et al model. Here, we demonstrate that the inference 
algorithm is computationally efficient, accounts for synergistic regulation and self-regulation, and 
recovers the correct parameters used to generate the data. Furthermore, the method only requires 
a set of steady-state gene expression measurements. Experimental researchers in the biological sci- 
ences can use this method to infer gene networks in a much more principled, detailed manner than 
earlier approaches allowed. 

Dynamical Systems Model We model gene expression regulation as a dynamical system. Let 
x E W 1 represent RNA concentrations and y E W 1 represent protein concentrations corresponding 
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to a set of n genes. We assume that the production rate of the RNA transcript X{ of gene i 
is proportional to the probability that RNA polymerase (RNAP) is bound to the promoter, the 
production rate of protein product yi of gene i is proportional to the concentration of the RNA 
transcript Xi, and that both the RNA transcript and protein products of gene i degrade at fixed 
rates (X^ NA , \P rotein ^ . The probability that RNA polymerase binds to the promoter is modeled 
as a nonlinear function / of y, since RNAP binding is regulated by a set of transcription factors. 

^ = nfi(y) - \f NA Xi 
(i) ^ = r iXi - \ p r uin yi 

Based on the thermodynamics of RNAP and TF binding, we assume the following form for fi 
(Bintu et al., 2005b,a): 

, 9 x r , n _ ^o + E^i bijU keSij Xk 

\ Z ) Ji{y) — -i , v^m „ -,-r 

1 + l^j=l CijUkeSijUk 

where Sij lists the gene products that interact to form a regulatory complex, and bij,Cij are non- 
negative coefficients that must satisfy Cij > bij > 0. The coefficients b{j and Cij depend on the 
binding energies of regulator complexes to the promoter, bio and qq correspond to the no-regulator 
case (HktSioVk — 1) 5 an d the coefficients are normalized so that qo = 1- Details and a derivation 
are given in Appendix A. 

The form of fi allows us to model the full spectrum of regulatory behavior in quantitative detail. 
Terms that appear in the denominator only are repressors, and the degree of repression depends 
on the magnitude of the coefficient, while terms that appear in the numerator and denominator 
may act as either activators or repressors depending on the relative magnitudes of the coefficients 
and the current gene expression levels. Terms may represent either single genes or gene complexes. 
The model can even be extended to account for environmental factors that affect gene regulation, 
though we will not discuss it further here. 

As an example, consider the simple two-gene network shown in Figure 1. Suppose that genes 1 
and 2 have RNA concentrations xi,x 2 , and protein concentrations y\ 1 y 2l respectively, and that 
gene 1 is activated by protein 2 and repressed by its own product (protein 1), while gene 2 is 
repressed by a complex formed by proteins 1 and 2. The situation corresponds to the following 
equations: 

dxi bio + bnX 2 \RNA d Vl „ „ x Protein ol 

at 1 + c\\X2 + c\2X\ at 

( o\ dx 2 _ ^20 \RNA„ dy 2 _ .Protein^ 

at 1 + C21X1X2 at 

In the notation above, we have Su = {2}, S12 = {1}, S21 = {1, 2}. The parameters 610, 611, en, . . . 
determine the magnitude of the the repression or activation. As this example shows, the model is 
flexible enough to capture a wide range of effects, including self-regulation (that is, regulation of 
a gene by its own protein product, most commonly as repression) and synergistic regulation by 
protein complexes (two or more proteins bound together to form a regulatory unit), in quantitative 
detail. Furthermore, the model is predictive: if we know or can infer the coefficients in the model, 
we can predict the future behavior of the system starting from any initial condition. 
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Figure 1: Simple two- gene network example described by equation 3 (with parameters bn = en = 0.1 is 
typical for activators, c\2 — 10 for repressors, and b\§ = 0.01 for constants in the numerator). Gene 1 is 
activated by the protein product of gene 2 and repressed by its own product (an example of self-regulation). 
Gene 2 is repressed by a complex formed by the product of gene 1 and its own product (synergistic self- 
regulation). In the diagram, the edge colors indicate activation (green) or repression (red) and the edge 
weights indicate coefficient sizes, with typical sizes shown in the figure. 



Inference problem. The model given by equations 1 and 2 fully describes the evolution of RNA 
and protein levels and provides a comprehensive, quantitative model of gene regulation, provided 
we know the parameters. Unfortunately, 6^-, Cij and extremely difficult to measure, as they depend 
on binding energies of RNAP and TFs to the gene promoter. The sheer number of measurements 
required to characterize all possible TFs (both individual proteins and complexes) also makes this 
approach infeasible. Therefore, our goal is to use a systems level approach to fit the model using 
RNA expression data. Specifically, we will assume that Ti 1 \f NA 1 \f rotein are known or can be 
measured (alternatively, we can simply absorb these terms into the coefficients b{j,Cij and ignore 
them). Our data will be measurements of the RNA concentrations x at many different cellular 
steady states (which correspond to steady-states of the dynamical system). The problem is to infer 
the values of the coefficients 6^-, c^-. 

Linear constraints at steady-state. The key to solving this problem efficiently is to restrict our 
attention to steady-states, as proposed by Choi (Choi, 2012). This allows us to transform a nonlin- 
ear ODE fitting problem into a linear algebra problem. A steady-state of the system is one in which 
RNA and protein levels are constant: ^ = ^ = 0. Steady-states of the system correspond to cell 
states with roughly constant gene expression levels, like embyronic stem cell, skin cell or liver cell. 
In contrast, an embryonic stem cell in the process of differentiating is not in steady state. Perturbed 
steady-states are particularly interesting. After a perturbation like gene knockdown, a cell's gene 
expression levels are in flux for some time while they adjust to the change. Eventually, if it is still 
viable, the cell may settle to a new steady-state. These perturbed steady-state are especially helpful 
for understanding gene regulation. 

In our model, the steady-state conditions ^ = ^ = mean that: 

= ti/, (y) - \f NA x u = r lXl - Af***» W => y„ - n%i 



n 



Af 



'rotein 



Defining /;(» = /,( A ^ tein z) yields 



= nfiix) - \f NA Xi 

the coeffi 

we obtain the final equation 



Absorbing the constants into the coefficients bij, Cij, (so that bij — bijUkeSi ■ X Protein •> <kj — c ij^-keS, — — 



ij \ Protein 1 ^11 - L - L /cGOi 7 - \Protei 



ho + Ysj bijU keStJ Xk 

n : . ^ ^ liXi = 0, 
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or 



Ti(biO + ^2 b ij u keS ZJ Xk) - 7iXi(l + ^2 C iJ U keS ZJ Xk) = 0, 



(by multiplying both sides by the denominator). The last equation is linear in the coefficients bij,Cij\ 
In order to solve for fr^c^-, we will need to collect many different expression measurements x at 
both naturally occurring and perturbed steady-states. Each steady-state measurement will lead to 
a different linear equation. These can be arranged into a linear system that we can solve for the 
coefficients. 

Problem formulation. Our problem is to find 6^-, c^- such that 

= Ti(bi + ^2b ij U keSij x^) - jiXi(l + ^CijU keSij x^), Vra = 1, . . . ,M, 
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given RNA expression data at many different steady-state points m = 1, . . . , M and known 
translation and degradation rates Ti,\f NA ,\f rotein . We solve a separate problem for each gene z, 
since the coefficients in the differential equation dxi/dt = ... for gene i are independent of 

the coefficients in the differential equations for other genes. Since we cannot know ahead of time 
which potential regulatory terms Hk^s^Xk are actually involved, we include all possible terms up 
second-order and look for sparse 6^,c^-, intepreting c^- = to mean that term H^SijXk is not a 
regulator of gene i. 

Consider gene 2 in the two-gene example. Suppose we have expression measurements for a naturally 
occurring steady state x%), and a perturbed steady-state following gene 1-knockout (0, x\). We 
obtain two linear equations in the coefficients 620, C21: 

T&20 — ^2^2(1 + 021x^X2) = 0, (steady-state (x^x®)) 
7-620 — A2X2 = 0, (steady-state (0, x\)) 
If we knew a priori that complex x\X2 was the only regulator of gene 2, these two equations would 
allow us to solve for the coefficients (620 — C21 = x ? f 2 2 )• Typically we do not know the 

regulators beforehand, however, and we need to use the data to identify them. That is, we include 
all possible terms (up to second order) in the equation: 

t(&20 + ^21^1^2 + ^22^1 + ^23^2) - A 2 X 2 (1 + C21X1X2 + C 2 2^1 + C 2 2^2) = 0. 

and estimate sparse coefficients hj^Cij using several steady-state measurements (x^™\x^). (We 
should find that the recovered coefficients 621, 622, 623, C22, C23 are very close to zero, since the cor- 
responding terms do not appear in the true equation.) 

We can compactly express the general system above by defining Z{ as the vector with entries 
Zi(j) = HkeSijXk (with the convention that Z{(0) — 1, Z{(j) = Xj for j = 1, . . . , n), which yields 

= nbfzi - jiZi(i)cfzi, Vra = 1, . . . , M. 

If we form a matrix Gi by concatenating the row vectors z^\ . . . , z\ M ^ and let D\ be a diagonal 
matrix with entries z^ m \i), m = 1, . . . , M, we can express this as 



[Gi -TAG,] 
with the constraints < b\ < Cj, q(0) = 1 



= 0. 



Algorithm. We need to solve the linear system 



[d -jDiGi] \ bi ] = 0. 

for bi,Ci, subject to the constraints < bi < Ci, q(0) = 1. In order to account for measurement 
noise and encourage sparsity in bi, Ci (since we know that each gene has only a few regulators), we 
will minimize the ^2-norm error with l\ regularization (Tibshirani, 1996). This leads to the convex 
optimization problem 



minimize || [Gi —^DiGi 



b + Adl^Hi + llQUi) 



(4) subject to < 6j < Cj, q(0) = 1, 

where A is a parameter controlling sparsity that we can choose using cross validation. Since the 
problem is convex, it can be solved very efficiently even for large values of n and m. 

Nonidentifiability. Our model's ability to capture self-regulation is very powerful, but it also 
leads to a particular form of nonidentifiability. For certain forms of the equation, given only steady- 
state measurements, it can be impossible to determine whether self-regulation is either completely 
absent or present in every term. Specifically, any valid equation of the form: 



dxi ho + J2jLi bijH-keSijXk 
1 + J2jLi Cij^-keSij Xk 



(5) — = — ^ jiX h b i0 <l 



is indistinguishable at steady-state from any member of the following family of valid equations 
indexed by the constant c: 



dxi (cbio+i^Xi + Y^icbijUkeSijXiXk 7 

(6) — — = j^- 7i#i, c > — 

dt 1 + cxi + Y,j=i ccijU keStj XiX k 1 - 

We will refer to these as the 'simple' and 'higher-order' forms of the equation, respectively. The 
short proof of their equivalence is given in Appendix B. The condition c > t^— guarantees that 
c > and < cbiQ + 7^ < c (since < 6^0 < 1) and < cbij < ccij (since < bij < Cij). 

We can distinguish between these two alternative forms by measuring the derivative of the con- 
centration away from steady-state and comparing it to the derivative predicted by each form of 
the equation. This requires only a few extra thoughtfully-selected measurements. The details are 
in Appendix C. 

Simulated six-gene subnetwork in mouse ESC. To demonstrate the inference approach, we apply 
our method to a synthetic six- gene system based on the Oct4, Sox2, Nanog, Cdx2, Gcnf, Gata6 
subnetwork in mouse embryonic stem cell (ESC). Chickarmane et al developed this system based 
on a synthesis of knowledge about ESC gene regulation accumulated over the past two decades 
(Chickarmane and Peterson, 2008). The network structure is shown in Figure 3a, and the detailed 
model is given by the following system of ODEs in the six genes: 
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d[0] 0.001 + [A] + 0.005[O][S] + 0M5[O}[S}[N] 

dt ~ l + [A] + 0.001[O] + 0.005[O][5] + 0.025[O][5][A^] + 10[O][C] + 10[Gc] ' [ J 
d[S] _ 0.001 + 0.005 [O] [S] + 0.025 [O] [S] [N] 



dt l + 0.001[O] + 0.005[O][S] + 0.025[O][S][./V] 
d[N] _ 0.001 + 0. 1 [O] [S] + 0. 1 [O] [S] [N] 



dt 1 + 0.001 [0] + 0.1[O][S] + 0.1[O][5][JV] + 1O[O][0] 
d[C] _ 0.001 + 2[C] 



0.1[5] 

0.1[N] 



dt l + 2[0] + 5[O][0] 
d[Gc] _ 0.001 + 0.1[C] +0.1[G] 
dt ~~ 1 + 0.1[G] + 0.1[G] 



0.1[G] 

0.1 [Gc] 



d[G}^ 0.1 + [O] + 0.00025[G] 
{ ' dt 1 + [O] + 0.00025[G] + 15[N] 1 J 

This model has many of the same qualitative characteristics as the biological mouse ESC net- 
work (Chickarmane and Peterson, 2008). In particular, the system can support the four different 
steady-states: embryonic stem cell (ESC), differentiated stem cell (DSC), endoderm and trophec- 
toderm, and can switch between them when certain genes' expression levels are changed. In the 
Oct4 equation, A represents an external activating factor, whose concentration [A] depends on the 
culture condition. Each of the four steady-states has a corresponding value of [A]: 10 for ESC and 
DSC, 25 for endoderm, and 1 for trophectoderm. For the remainder of this paper, we will regard 
[A] as known. The explicit system ODEs (7) allows us to generate data to fit our model and also 
to quantitatively compare our recovered solution to the ground truth. The qualitative similarity 
of this synthetic network to a real biological network gives us confidence that the results of this 
experiment are likely to translate well to real biological networks. 



We observe that the Cdx2, Gcnf and Gata6 equations have alternative forms (provided we ignore 
constant term : 
of c, the alterna 

d[C] 0.95 



the very small constant term in the equation and [G] term in the ^gp) • With the minimum 
possible value of c, the alternative forms are: 



dt ~ 1 + 2.5[0] (C ~ 2 ^ 
d[Gc] _ 0.1001[Gc] + 0.01 [C][Gc] + 0.01[Gc][G] 
dt ~ 1 + 0.1 [Gc] + 0.01 [C][Gc] + 0M[Gc][G] 



0.1 [Gc] (c = 0.1) 



(H) d[G] - 0-lll[G]+0.111[O][G] r r -nilD 

[ ) dt l + 0.111[G] + 0.111[O][G] + 1.67[JV][G] [ J [ ' 

To resolve this, we will apply our method twice, once allowing self-regulation and again disallow- 
ing it. Then we will compare the two recovered forms of each equation and quality of the fits to 
determine whether nonidentiflability exists in each case. If so, we will break the tie by examining 
derivatives. 



To fit the model, we collect data on the the expression levels of all six genes at many different 
steady-states. First we measure the expression levels at all four wildtype steady states: SC, DSC, 
endoderm and trophectoderm. We also induce additional perturbed steady-states by simulating 
knockdowns and knockups of each gene. For a knockdown, we hold a gene at one fifth of its steady- 
state expression level; for a knockup we hold a gene at a twice its steady-state level. In each case 
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Expression trajectories following Oct4 knockdown from SC 




time 

Figure 2: Gene expression trajectories during an Oct4 knockdown from SC steady-state. The expression 
of Oct4 is artificially reduced to 20% of its SC steady-state expression level and held there, causing the 
expression levels of the targets of Oct4 to change in response, which in turn impact their targets. The system 
eventually reaches a new steady-state different from SC. We measure the vector of expression levels at the 
new steady-state and use it as data in the inference algorithm. Since Oct4 is knocked down, this induced 
steady state does not provide useful information about the Oct4 equation, but it is useful for understanding 
the role of Oct4 and other genes in the equations of the remaining five genes. 



we wait for the system to settle to a new steady-state, then measure the expression levels. Figure 2 
shows the expression trajectories during Oct4 knockdown from the ESC steady-state as an example. 

The details of the simulation are given in Appendix D. We begin by testing the algorithm on 
noiseless data. We solve the optimization problem 4 once, then we solve it again with additional 
constraints prohibiting self-regulation. In each case, we use cross validation to select the sparsity 
parameter A (Figure 5). The quality of the fit is comparable for the latter three equations whether 
we restrict self-regulation or not, while for the first three equations restricting self-regulation has 
a significant negative impact on the fit (Table 1), indicating that the first three equations are 
unambiguous while the last three have two possible forms. To resolve the nonidentifiability in the 
latter three equations, we measure the derivatives of Cdx3, Gcnf and Gata6 immediately after some 
additional informative perturbations: Oct4, Cdx2 and Nanog knockouts, respectively (Figure 6). 
The test reveals that that Gcnf and Gata6 have the simple form, while Cdx2 has a higher-order 
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form. This procedure allows for near-perfect recovery: 

d[Q] = 0.001 + [A] + (0.005[Q][5] + 0.025[O][S][AT]) 

dt l + [A] + (0.001[O] + 0.005[O][5] + 0.025[O][5][iV]) + 10[O][C7] + 10[Gc] ' [ J 

d[S] = 0.001 + 0.005[O][S] + 0.025[O][S}[N} 

dt ~ l + 0.005[O][5] + 0.025 [O] [5] [AT] ' [ J 

d[N] _ 0.1[O][5] + 0.1[O][g][iV] 

dt l + 0.1[O][5] + 0.1[O][5][iV] + 10[O][G] 1 J 



~dT ~ 1 + 2[C] + 5[0][C] 
d[Gc] _ 0.001 + 0.1 [C] + 0.1[G] 
~ 1 + 0.1[C] + 0.1[G] 



-0.1[Gc] 



f9l 0.1 + [Q] 

1 ; l + [O] + 0.03[AT][Gc] + 15[iV] L J 

Next we add zero-mean Gaussian noise to each measurement, with standard deviation 1% of the 
measurement magnitude. We use the same steady-states as in the noiseless case, plus knockup- 
knockdown of each pair of genes starting from ESC and DSC. Using a similar approach (detailed 
in Appendix D), we recover: 



m = \3 O .i[o] 

dt l + [A]+9.9[Gc]+9.9[0][C\ 1 J 
d[S] _ 0.001 [O] [S] + 0.0005[5][AT] + 0.025[O][5][JV] 



dt 1 + 0.001 [O] [S] + 0.0005[5][AT] + 0.025[O][5][AT] 
d[N] 0.09[O][S][N] 



0A[S] 



dt 1 + 0A[G][Gc] + 0.09[O][5][AT] + 9.1[0][G] 

d w . m .i[ci 

0A[Gc] 



0.1 [AT] 



dt 


~~ 1 + 2[C] + 5[0][C] 


d[Gc] 


0.1 [C] +0A[G] 


dt 


~~ 1 + 0A[C] + 0A[G] 


d[G] 


0.1 + 0.9[O] 


dt 


1 + 0.9[O] + 14.2[AT] 



(10) -V = l + 0.9[ O ] + l4.2 [ Ar] -°-" G ' 

In order to produce clean equations and network diagrams, we choose appropriate thresholds for 
each equation below which we zero the coefficients. We set the thresholds at 0.1% (noiseless case) 
or 1% (noisy case) of the largest coefficient recovered for each equation. For example, the largest 
recovered coefficient in the d[G]/dt equation is roughly 15 in either case, so we zero the coefficients 
that fall below 0.015 (noiseless case) or 0.15 (noisy case). The recovered systems of equations shown 
above reflect these choices. In the noiseless case, relaxing the threshold on the Oct4 equation to 
0.01% leads to the recovery of more correct terms, listed in parentheses. For completeness, we also 
provide receiver operating characteristic (ROC) curves in Figure 4 to show the tradeoff between 
true positives and false positives at other thresholds. The network diagrams in Figure 3 (b,c) include 
an edge if the corresponding coefficient is above the threshold, with weights reflecting the size of 
the coefficients. These diagrams show that the recovery is nearly perfect in the noiseless case: using 
the gentler threshold for the Oct4 equation we recover all the true edges except for three very weak 
ones, and return just one small false positive repressor in the Gata6 equation. In the noisy case, 
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Figure 3: Recovery of a synthetic gene regulatory network based on the biological ESC network using our 
inference algorithm. The diagrams represent sytems of ODEs that quantitatively model the gene interac- 
tions. Edge color indicates activation (green) or repression (red), and edge weights correspond to coefficient 
magnitudes. The arrows point from regulator to target, and self-loops indicate self-regulation. The yellow 
star represents the third-order complex OSN. (In addition to all possible first- and second-order terms, we 
allow this special third-order term with a free coefficient.) The left figure represents the original system of 
ODEs used to generate the data. The center figure shows the network recovered using our inference algorithm 
on noiseless data, and the right figure shows the recovery with 1% noise added. Both recovered networks 
reflect coefficient thresholding at 0.1% (noiseless case) or 1% (noisy case) of the largest recovered coefhcent 
in each gene equation (with the exception of the noiseless-case Oct4 equation, thresholded at 0.01% to show 
the sucessful recovery of weak edges). The algorithm performs almost perfectly in the noiseless case, except 
for a false positive repressor on Gata6 and three very weak activation edges missing. In the noisy case, the 
algorithm recovers all of the strong edges, but misses some of the weaker ones and returns a few small false 
positives at our chosen thresholding level. Overall, the method captures the major network structure even 
in the noisy case. 



we recover all the large coefficients correctly, although there are a few small false positives and the 
we miss several of the weakest edges. Overall, the method successfully captures the major network 
structure even in the presence of low-level noise. 

Discussion. Our experiment on the synthetic ESC system demonstrates that our algorithm can 
accurately recover the coefficients in a complex dynamical systems model of gene regulation and 
that the method can tolerate low levels of noise. Term selection from among all possible single gene 
and gene-complex regulators (up to second-degree interactions, plus the third-degree interaction 
OSN) was successful. The inferred equations are easy to interpret in terms of gene networks, and 
the detailed quantitative information allows for prediction of future expression trajectories from 
any starting point. 

The approach is also scaleable. Since we have formulated our problem as a convex optimization 
problem (Equation 4), it can be solved efficiently even for large systems using prepackaged software. 
Furthermore, it is trivially parallelizable, since we need to solve a version of Equation 4 to infer the 
differential equation coefficients bij,Cij for each gene i. Parallelization is even more helpful for the 
cross validation step, where we need to solve Equation 4 for each gene and a sequence of choices 
sparsity parameter A. We tested the scaleability by running the algorithm with the parallelization 
discussed above on a simulated 100-gene system. The algorithm ran correctly in a reasonable time 
frame on a computing cluster. 

The high resolution of our model is one of its most valuable features, but it means that accu- 
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Figure 4: ROC curves for recovered networks from noiseless (right) and noisy (left) data showing the tradeoff 
between true positive rate (TPR) and false positive rate (FPR) for edge recovery. The ROC curves show the 
TPR and FPR that result from a range of coefficient threshold choices above which we consider an edge to 
have been recovered. For the equation dxi/dt = . . . and threshold t, TPR is defined as the proportion of true 
edges j with c ^ covered > t and FPR as the proportion of false edges with Recovered > ^ p or e q Ua ti ns with 
two possible forms, we compare the simple forms of the true and recovered equations. Each gene equation 
has a different ROC curve as indicated by the legend. The dotted black line is the expected ROC curve for 
'random guessing' algorithm, while the (0, 1) point corresponds to a perfect algorithm (in fact, our algorithm 
performs perfectly for the Gcnf equation). 



rate term selection may require lots of data, especially in the presence of noise. In our experiment, 
when we added 1% Gaussian noise, we needed extra data (knockdown/knockup pairs) in order 
to accurately select terms. When we tried 5% noise, the algorithm consistently selected the large 
terms in five of the six equations, but we had to add even more data in order to correctly idenify 
the major repressor in the Nanog equation. The Nanog equation is subtle in that Oct4 acts as 
both an activator in complexes with Sox2 and Nanog and a repressor in a complex with Gata6, 
so the algorithm tends to select different Gata6 complexes (or the Gata6 singleton) as the major 
repressor when the data is insufficient. In the 5% noise case, we needed additional data on the 
role of Gata6 (double-knockdowns and double-knockups of pairs including Gata6 from ESC and 
DSC) in order to select Oct4-Gata6 as the major repressor of Nanog fairly consistently. Another 
difficulty we have already discussed is the nonidentifiability that arises from accounting for self- 
regulation while restricting data to steady-states. Distinguishing between the two possible forms of 
nonidentifiable equations requires extra derivative data (which is harder to collect experimentally) 
and extra steps in the algorithm. The constraints on the convex optimization problem (4), which 
arise from thermodynamic considerations, are sufficient to prevent further nonidentifiability, but in 
certain cases, certain problems can suffer from near-nonidentifiability of other forms, which may 
contribute to the challenge of term-selection with noisy or limited data. We ensure accurate term 
selection by making sure we include enough diverse, high-quality steady-state measurements. 

We should also note that our model does not account for the intrinsic noise in gene transcription 
and translation, although these processes are inherently stochastic, since TF and RNAP binding re- 
sult from chance collisions between molecules in the cell. However, the the stochastic version of our 
rational-form transcription model is highly complex and there is currently no satisfactory method 
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for its inference. Studying the deterministic evolution plus additive noise is standard practice for 
all but linear models of gene expression, and treatment of the deterministic model provides insight 
into the stochastic model. Here we focus on the additive noise case and leave the study of intrinsic 
noise for future investigation. 

Conclusions. Our model is based on the detailed thermodynamics of gene transcription, and 
quantitatively captures the full spectrum of regulatory phenomena in a detailed, physically in- 
terpretable, predictive manner. Since we can formulate the model fitting problem as a convex 
optimization problem, we can solve it efficiently and scaleably using prepackaged software. i\- 
regularization allows for term-selection while maintaining the problem convexity. The experiments 
required to collect the necessary steady-state gene expression data are straightforward to perform, 
as technologies for knockdowns and knockups are well-established and measuring gene expression is 
relatively simple. The model accounts for activation and repression by single-protein TFs and syn- 
ergistic complexes as well as self-regulation, and describes the magnitude of each type of regulation 
in quantitative detail. Furthermore, the model can be extended to account for environmental effects 
and auxiliary proteins involved in regulation, including enhancers and chromatin remodelers. The 
fitted model can accurately predict the evolution of the system from any starting point. Given a set 
of steady-states gene expression measurements, our algorithm can be used to fit a model which not 
only predicts further steady-states of the system, but also fully describes the transitions between 
them. The resulting quantitative model provides unprecedented insight into and control over the 
entire regulatory lifecycle of the cell. 



Appendices 



APPENDIX A: THERMODYNAMIC MODEL 

In Equation 1, the function fi(y) represents the probability that RNAP binds to the ith gene 
promoter. We claim that fi(y) has the form: 

m - p£Ld(v) - ^ 



Zj(l + e-^ P)e-^U keSij y k 

where Ae^ is the binding energy of the jth complex to the promoter, Ae^ NAP is the binding energy 
of RNAP to the jth promoter-bound complex, and P, xj are the concentrations of RNAP and gene 
product j (Bintu et al., 2005a, b). 



Any type of regulator (including no regulator at all) can be represented in this framework. For 
no regulator, we take Sij = with the convention that Hke^Vk = 1, set Ae^ = 0, and take Ae^ NAP 
as the base binding energy of RNAP to the promoter. For a repressor, Ae^ < and Ae^ NAP > 0; 
for an activator, Ae^- < and Ae^ NAP < 0. 

Setting 

Cij = (i + e -^HNAP p)e -^ j 
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we obtain the form given in Section 1: 



fi{y) = 



Constant terms in the numerator and denominator correspond to the no-regulator case. Letting 
Cio denote the constant appearing in the denominator, our convention will be to divide all of 
the coefficients in the numerator and denominator by Qo so that the constant 1 appears in the 
denominator. 

Simplified derivation. The derivation we present here follows Bintu et al and Garcia et al 
(Bintu et al., 2005a,b; Garcia et al., 2011). For simplicity, we will prove the following claim for the 
simplified case with one regulator y\ (as well as the possibility of RNAP binding with no regulator): 

(i) e -/?A £ fNAP p + e - P A e ™*Pp e -{}Ae ilyi 

Abound " (1 + e _ Me RNAP jP) + (1 + e - M£ H N AP p)e _^ Aeayi 

We will use the following notation: Cp a is the energy of the state in which RNAP is specifically 
bound to the regulator-promoter complex, 6p i0 is the energy of the state in which RNAP is specif- 
ically bound to the promoter without the regulator, 6p S is the energy when RNAP is bound to a 
nonspecific binding site, is the energy when y\ is specifically bound to the promoter, and is 
energy when y\ is bound to a nonspecific binding site. Then 



a RNAP a _ S NS a RNAP a _ S NS a _ S NS 

Suppose that we have p RNA polymerase molecules and r molecules of gene product 1 (the reg- 
ulator). We model the genome as a "reservoir" with n nonspecific binding sites (to which either 
RNAP or regulator can bind). One of these sites is the promoter of gene i. Three different classes 
of configurations interest us: 

1. empty promoter 

2. regulator bound to promoter 

3. regulator and RNAP bound to promoter 

4. RNAP only bound to promoter 

These correspond to the following partial partition functions, which represent the "unnormalized 
probabilities" of each configuration. 

1. Z(p,r) 

2. Z(p,r-l)e"^ c ?i 

3. Z{p - 1, r - l)e-^e~^P^ 

4. Z{p- l,r)e -/3c ^o 



where Z(p.r) = , u n! V] e P re ii S e P € p^ 



p\r\(n— p— r)\ 

Zip, r) is equal to the total number of arragements of RNAP and y\ on the nonspecific bind- 
ing sites times the Boltzmann factor, which gives the relative probability e~^ e of a particular state 
in terms of its energy e. 
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Since RNAP binds the promoter only in the third and fourth classes of configurations, the prob- 
ability that RNAP binds the promoter is equal to the unnormalized probability of the third and 
fourth configurations divided by the "total probability" (the sum of the unnormalized probabilities 
of all classes of configurations). Hence 

Zip - 1, r)e _/3e ^o + Zip - 1, r - l)e-^e~ p ^ 

Abound 



Z{p, r) + Z{p - 1, r)e _/3e ^o + Zip, r - l)e"^i + Zip - 1, r - \)e~^e~ Pe ^ 

g^ e -^e^ e -/3(p-l)e^ e -/3eg ; , Q + ^gn^ e -p( r -l)ej[ s e -P(p-l)e^ s ^fief,^ 

p\r\ * (p— l)!r! ' ' ' 

n n n 



n n n n 

P p -pAe P , i0 , Pr -pAenp-PAepj! 



n ' n n 



1 + £p-/3Ae P , i0 _i_ L -pAen i EL -pAe ile -PAe Pjil 

1 n 1 71 71 71 

P e -pAep ii0 i PL -pAen -pAe Pjil 

72 ' 72 71 

1 _|_ £ e -/3Ae P , i0 _|_ r e -/3Ae a (l + P e ~PAe P ^ 
"p^PAef^^ py^-pAeal-PAef^ 

1 + p e -^o NAP + yi e-^i (i + Pe -/3AefNAP ) > 



where in the second line we used the approximation j9 , r!( - n ? ! ! j9 _ r - )! ~ whichs hold for r << n, in 
the third we divided by n 0^e~ f3re ^ S e~^ e p S , in the third we used the identities Aep^o = ef> i0 — ep S , 
Aep^i = ep a — ep^, Ae^i = — e^, and in the last we substituted in the definitions ^ = P, 

r _ a RNAP _ a ^ a .RNAP _ a . 

APPENDIX B: NONIDENTIFIABILITY 

To see that the equations 

dxi ho + J2 J jLi^ ) ij^-keS ij Xk 

and 

^ (c6io + li)xi + X)^Li cbijlLkeSaXiXk 7 
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reduce to the same equation at any steady-state where x\ ^ 0, choose any < < 1, < bij < Cij, 
1 < j < iV, c > ^ , and calculate: 

(cb i0 + + X)^Li cbijUkeSijXiXk 



1 + CX; + X^=l CCijU keSij XiX k 



"3 = 

N N 

^=> = (c6 i0 + li)xi + ^ cbijIi k ^ s%j XiX k - jiXi(l + cx^ + ^ ccijTlkeSijXiXk) 

3=1 3=1 
N N 

= cxi(bio + ^2 bijH-keS tj Xk ~ 7*^(1 + ^ c^II^^.^)) 

iV iV 

x k i(l + ^ c^n fcG 5^.x fc )^ (provided x^ 7^ 0) 

&i0 + D^Li bijUkeSaXk 
^> ^ 7^ 

APPENDIX C: TIE-BREAKING 

Assuming that we allow only first and second-order terms, we can determine whether a given 
equation is ambiguous as follows. If it includes no self-regulation at all, it is of the simple form, and 
has a class of alternatives of the higher-order form parametrized by c > iJ b . Q • On the other hand, 
if it includes self-regulation in every term except for the constant 1 in the denominator, and the 
coefficient of X{ in the denominator is greater than 7^, then it is of the higher-order form and has 
an alternative of the simple form. 

Practically, the simplest way to make this decision is solve the optimization problem 4 twice: once 
normally, and once without allowing self-regulation (that is, adding the additional constraints that 
b^ = whenever X{ is in the jth complex). We can compare the forms of the recovered equations 
as well as the quality of the fit (i.e. the unregularized objective). If the equation is ambiguous, the 
restriction on self-regulation will have little effect on the quality of fit, since it will simply cause the 
algorithm to choose the simple alternative. Comparing the recovered equations, we will also notice 
that they are either the same, or have the relationship given by equations 5 and 6. On the other 
hand, if the equation is unambiguous, the first recovered equation will not have the form of either 
equation 5 or equation 6, and the quality of the fit will be significantly worse when self-regulation 
is restricted. This test may not always be conclusive (for example, this occurs for the Nanog and 
Gata6 equations in the noisy simulation), but if we are unsure we can always apply the derivative 
tie-breaker described below to both versions of the ambiguous equation as well the equation recov- 
ered normally, and select the form that makes the best prediction. 

In order to choose between the possible forms of an ambiguous equation (and possibly find c), 
we can measure the derivative ^ experimentally and check whether it agrees with the value pre- 
dicted by the simple form of the equation. Specifically, we can choose and perform a perturbation 
that is likely to have a major impact on the system, measure the concentrations shortly afterwards, 
and approximate the derivative by: 

dxi Xi(ti) - Xi(t ) 

-IT W) ~ 



dt t\ — to 
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(This type of experiment is not easy to carry out on a large scale, so we must choose which 
derivatives to measure with care, and do so only when necessary.) Next we predict the derivative 
following the perturbation using the simple equation. For example, if we knock out term H^Su^k 
starting from steady-state /i, the simple form predicts: 

dxi _ ho + J^j^jbijH-keSij^k 

If the measured and predicted derivatives agree, we know that the simple form is correct. Otherwise, 
we conclude that the true equation has the higher-order form, and estimate c as follows: 



(ii) 



dxj 
' dt 



(ift + H x i){ x i + Ejli CijU keSlj XiX k ) - b i() Xi - J2f=i hj^-keSijXiXk 



From a practical perspective, the measured derivative will not agree exactly with the predicted 
derivative, so if we are unsure whether we have a match, we can solve for c and determine whether 
it is possible (c > 1 J b . Q ) and reasonable. Furthermore, if we are unsure whether or not the equation 
is ambiguous, we can also predict the derivative with the equation recovered without restrictions, 
and compare this prediction with those of the two alternative equations. 

APPENDIX D: SIMULATED SIX-GENE SUBNETWORK IN MOUSE ESC 

We test our method on a synthetic network governed by the system of ODEs given in (7). The 
^S^' an d ^dt~ equations are ambiguous with the alternative forms given in (8) (provided we 

ignore the very small constant term in the equation and [G] term in the . The equation 
has the higher-order form and an alternative simple form, while the , and equations have 

the simple form and alternative higher-order forms. (Although [S] appears in every term of the 
equation, it is not ambiguous since cbio + 7^ = 0, which is impossible since c > 0, > 0.) We apply 
our method twice, once allowing self-regulation and again disallowing it. Then we compare the two 
recovered forms of each equation and their fit to the data to determine whether nonidentiflability 
exists in each case. If so, we break the tie by examining derivatives. We do this for both noiseless 
and noisy data. 



Without noise, we use a total of 52 measurements: the expression levels at each of the system 
steady-states (ESC, DSC, Endo and Trophect) and expression levels at the steady states reached 
after overexpressing each gene at twice its steady-state level, and knocking it down to one-fifth of 
its steady-state level, starting from each basic steady-state. We use cross validation (CV) to select 
the sparsity paramter A for each equation, with and without self-regulation (Figure 5). We use 
CVX software to solve the convex optimization problem (Grant and Boyd, 2011, 2008). When we 
solve without restricting self-regulation (using the sparsity parameters chosen by CV), we recover 
the following equations (with coefficients thresholded at 0.1% of the largest recovered coefficient): 



16 



Table 1 

Quality of fit (unregularized objective value) for noiseless data 



Equation 


unrestricted solution 


no self-regulation 


Oct4 


1.249 x 10~ b 


5.7674 


Sox2 


1.1226 x 1(T 9 


0.4269 


Nanog 


6.7426 x 10" 8 


0.7664 


Cdx2 


3.5627 x 10" 7 


9.7365 x 10" 7 


Gcnf 


1.2954 x 10~ 7 


2.130 x 10~ 7 


Gata6 


7.7505 x 10" 7 


7.9072 x 10" 5 



& = M + 0.001 + (0.005[O][S] + 0.025[O][5][iV]) (0 01% threshold) 

dt l + [A] + (0.001[O]+0.005[O][5]+0.025[O])[5][JV] + 10[O][C] + 10[Gc] L J 1 ' 
d[S] _ 0.001 + 0.005[O][S] + 0.025 [0][S][iV] 



dt 1 + 0.005 [O] [S] +0.025 [O] [S] [N] 
d[N] _ Q.l[O}[S} + 0.l[O}[S}[N] 

~dt~~ ~ 1 + 0.1 [0][S] + 0.1[O][S][iV] + 10[O][G] 
d[C] _ 0.95 



0.1[S] 
- 0.1 [AT] 



0.1[C] 

nA -i- n m \a\ \aA 

0.1 [Gc] 

0.1[G] 



cit 1 + 2.5[0] 
d[Gc] _ 0.1 [Gc] + 0.01[C][Gc] + 0.01[G][Gc] 

" 1 + 0.1 [Gc] + 0.01 [G][Gc] + 0.01 [G][Gc 
d[G] _ 0.1 + 0.95[O] 



dt 1 + 0.95[O] + U.25[N] + 0.04[5] [N] + 0.08[N] [C] + 0.02[AT] [Gc] + 0.08[N] [G] 



When we solve the same problem, disallowing self-regulation (again using the appropriate CV 
sparsity parameters), we recover the following equations. 

d[0] = [A] + 0.14[G] + 0.57[G] + 0.12[5][G] + 0.04[iV][G] + 0.20[A^] [Gc] 
dt ~ l + [A]+20[G]+9.6[Gc] + 3.3[G] + 0.33[5][G]+0.04[AT][C]+0.20[Af][Gc] ' [ J 
d[S]__ 0A8\g\[N] 
dt l + 0.18[O][iV] 1 J 
d[N] _ 0.01 + 0.12[O][g] 
dt l + 0.12[O][5] 1 J 

d[G] _ 0.95 



dt 



T^W]-°- 1[C] 

d[Gc] _ 0.001 + 0.1[G]+0.1[G] 

1 + 0.1[G] + 0.1[G] 11 

d[G]_ 0-1 + [0] 

l + [O] + 0.03[iV][Gc] + 15[iV] M 



We measure the quality of the fit by the unregularized objective value \\Gibi — 7-DjGjCj|| from 
equation 4 for each recovered equation. These values are given in Table 1. We can tell that first 
three equations are unambiguous, while the last three have alternative forms, since the quality of 
fit is roughly the same for the last three equations whether or not we restrict self-regulation, while 
for the first three, the fit is much worse when self-regulation is prohibited. Therefore, we can use 
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Figure 5: Cross validation (8- fold) on noiseless data. We estimate the test error for each gene equation 
and various choices of sparsity parameter by randomly dividing the 52 observations into 8 folds (groups), 
then leaving out each fold out in turn, training the model on the remaining 7 folds, testing on the omitted 
fold, and finally averaging the 8 resulting test errors. After repeating this process for each gene equation and 
several choices of sparsity parameter, we select the sparsity parameters corresponding to the lowest error 
for each equation, (a) Unrestricted case: we selected sparsity parameters [10 _1 , 10 -5 , 10 -6 , 10 -4 , 10 -2 , 10 -6 ]. 
(b) No self-regulation: we selected sparsity parameters [10 -2 , 10 -5 , 10 _1 , 10 _1 , 10 -4 , 10 -5 ]. 



the recovered forms of the first three equations, but we need to break ties between alternative forms 
of the last three equations. 

It's easiest to start with the simple forms of the last three equations, and determine the corre- 
sponding higher-order forms. To break the tie, we look at derivatives following Oct4, Cdx2, and 
Nanog knockouts, respectively, since these regulators are important in each of the three ambiguous 
equations (Figure 6). After each knockout we estimate the derivative with a finite difference, com- 
pare it to the derivative prediction made by the simple form of the equation, and accept this form 
if the match is good. Otherwise we compute c using equation 11, and (provided it is reasonable), 
accept the higher-order form with this choice of c. 

In this case, we estimate « 0.6004825 immediately after Oct4 knockout from ESC steady- 
state. The simple form of the equation yields « 0.78 immediately following the knockout, 
which is a poor match. Therefore we select the higher-order form and use the measured derivative 
to compute c = 2, which yields = 0.60. For the equation we measure « —0.13 

immediately after Cdx2 knockout from SC, and the simple form is a good match at —0.13 (com- 
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Figure 6: Derivative measurements to break ties between alternative forms, (left) Trajectory of Cdx2 
following Oct4 knockout from ESC: ^p(^o) ~ 0.6004825. (center) Gcnf trajectory following Cdx2 knockout 
from ESC: ^§^(t ) « -0.1261507. (right) Gata6 trajectory following Nanog knockout from ESC: ^(t ) ~ 
0.6908821. 



puting c for the higher-order form using the derivative yields an unreasonably large c ^ 86; using 
the minimum value c = 0.1 in the higher-order form yields — —0.017). Similarly, for the 

equation we measure ^j— ~ 0.69 immediately after Nanog knockout from SC, and the simple form 
is a good match at 0.69. In the end, we select the equations given in (9). 

Next we test the algorithm using noisy data by adding zero-mean Gaussian noise to each measure- 
ment, with standard deviation 1% of the measurement magnitude. We use the 4 basic steady-states, 
the steady states reached after knockdown and overexpression of each individual gene from each 
basic steady state, and those reached after knocking one gene up and one gene down from each pair 
of genes, starting from ESC and DSC. Again, we use cross validation to select the sparsity parame- 
ters (Figure 7). When we solve the problem with noisy data (with no restriction on self-regulation) 
and threshold at a level of 1% of the largest recovered coefficient, we recover: 

d M = M .i[O] 

dt l + [A]+9.9[Gc]+9.9[0][G] L J 
d[S] __ 0.001 [O] [S] + 0.0005[S][iV] + 0.025[O][S][iV] 



dt 1 + 0.001 [O] [S] + 0.0005[5][AT] + 0.025[O][S][iV] 
d[N] __ 0M[O][S\[N] 



0.1[S] 



dt 1 + 0.1[G][Gc] + 0.09[O}[S}[N] + 9.1[0][G] 
d[C] __ 0.94 



0.1 [AT] 



0.1[C] 

nA -i- n m \a\ \aA 

0.1 [Gc] 

71 

0.1 [G] 



dt 1 + 2.4[0] 

d[Gc] __ 0.1 [Gc] + 0.01[C}[Gc\ + 0M[G][Gc] 

dt ~~ 1 + 0.1 [Gc] + 0.01 [C][Gc] + 0.01 [G][Gc 

d[G] __ 0.1[G] +0.1 [Q][G] 



dt 1 + 0.2[N] + 0.1 [G] + 0.05[O][AT] + 0.1 [Q][G] + 0M[S][N] + 0.025[N][C] + 1.4[JV][G] 



19 




X 



Oct4 Sox2 Nanog Cdx2 Gcnf Gata6 

rt c c c n h qq -7 




X 



Figure 7 : Cross validation (8-fold on 108 observations, with the approach described in Figure 5) on noisy 
data (1% Gaussian noise): (left) unrestricted: we selected sparsity parameters [0.1,1,0.01,0.01,0.1,0.00001] 
(right) No self-regulation: we selected [0.1,1,0.01,0.001,0.1,0.01]. 



When we solve without allowing self-regulation, we recover: 
d[0] [A]+0.3[G] 



dt l + [A\ + 15.4[C] + 9.7[Gc] + 3.1 [G] + 0.5[S][C] + 0.6[7V][C] 
d[S] 0.2[O][N] 



0.1[O] 



dt l + 0.2[O][7V] 
d[N] 0.03 + 0.17[5] + 0.03[S][C] 



dt 1 + 0.17[5] + 0.03[5][C] 
d[C] 0.95 



0A[S] 

'i \n\ 

0.1 [iV] 



l + 2.5[0] 
rf[Gc] 0.1 [C] +0.1[G] 



1 + 0.1[C]+0.1[G] 
rf[G] 0.1 + 0.9[O] 



0.1[C] 

-0.1[Gc] 
0.1 [G] 



l + 0.9[O] + 14.2[7V] 

Table 2 shows that for noisy data, the quality of fit does not indicate as clearly which equations are 
ambiguous. For the Oct4 and Sox2 equations, the quality of fit drops dramatically when we restrict 
self-regulation, while it changes very little for Cdx2, Gcnf and Gata6, revealing that the Oct4 and 
Sox2 equations are correct, while Cdx2, Gcnf and Gata6 have a simple and a higher-order form. 
We break the tie between the two forms of the last three equations using derivatives as before. The 
Nanog equation is unclear, so we analyze derivatives to decide between the two alternative forms 
and the solution of the unrestricted optimization problem. First we observe that the higher-order 
version of the ambiguous form is illegal as it contains third-order terms, so we only need to choose 
between the unrestricted equation and the simple equation recovered without self-regulation. We 
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Table 2 

Quality of fit (unregularized objective value) for noisy data 



Equation 


unrestricted 


no self-reg 


Oct4 


1.5170 


8.1241 


Sox2 


0.2800 


1.006 


Nanog 


0.6877 


1.9347 


Cdx2 


0.5634 


0.8208 


Gcnf 


0.0278 


0.0599 


Gata6 


0.1278 


0.2224 



simulate the trajectory after Gata6 knockout from ESC and compare the derivative (-jjt = 0.023) 
to the predictions of the first equation (^p = 0.044) and the simple version (^p = —0.070), 
concluding that the first equation is correct. Finally, we obtain the equations given in equation 10. 
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